Detailing the various data sources used in the analysis, including their nature and relevance to the business task.
Excel:
We rely on Excel for the project’s initial stages, focusing on data exploration and basic analysis. Its user-friendly interface enables us to quickly glean insights and create preliminary visualizations, providing an initial grasp of the dataset.
R:
In the more advanced phases, we turn to R for critical tasks such as data merging, cleaning, transformation, and in-depth analysis. Leveraging its robust statistical capabilities and extensive libraries allows us to conduct comprehensive exploration and visualization, ensuring a thorough understanding of the data.
Tableau:
Strategically, we employ Tableau for crafting advanced visualizations tailored to address specific questions. Its dynamic and interactive features enhance the presentation of complex patterns, offering a deeper layer of exploration for stakeholders seeking nuanced insights.
PowerPoint:
PowerPoint becomes the final destination for presenting our key findings and insights. It transforms analytical results into a visually appealing and compelling narrative. PowerPoint’s capabilities allow us to create engaging presentations, ensuring effective communication of complex analyses to diverse audiences.
By harnessing the strengths of Excel, R, Tableau, and PowerPoint, this comprehensive toolset ensures a seamless workflow from initial data exploration to advanced analysis, sophisticated visualization, and impactful presentation.
The data is publicly available on this link: https://Cyclistic-tripdata.s3.amazonaws.com/index.html. This link contains monthly trip data files in CSV format from June 2013 to Jan 2024. The data is owned by the City of Chicago and licensed by Bikeshare, the operator of the Cyclistic bicycle sharing service.
The data consists of two types of files: Trip data and Station data. The trip data contains information about each bike trip taken by Cyclistic users, such as trip id, start time, end time, bike id, and trip duration. The station data contains information about each bike station in the Cyclistic network, such as name, latitude, longitude, capacity, and online date.
The data is divided into 64 folders, each containing data for a specific period of time. The folder names indicate the year, the month, or the quarter of the data. For example, the folder “Cyclistic_Stations_Trips_2014_Q1Q2” contains data for the first and second quarters of 2014, and the folder “202312-Cyclistic-tripdata” contains data for December 2023.
Each folder has one or two CSV files (trips and stations), some containing data for one month. For example, the file “202211-Cyclistic-tripdata.csv” contains data for November 2022, and others like “Cyclistic_Trips_2017_Q1.csv” contain data for the first quarter of 2017.
The station data is not available for all the folders. Only the folders that contain data for 2013, 2014, and 2015 have a separate CSV file for the station data. The station data for the other years is not provided in the data source.
The Trips data is chronologically organized by minute, hour, day, month and year, with each observation refers to a unique trip. Each trip has the following features:
trip_id: Integer (unique identifier for each trip
start_time: DateTime (the date and time when the trip started)
end_time: DateTime (the date and time when the trip ended)
bikeid: Integer (unique identifier for each bike)
tripduration: Integer (duration of the trip in seconds)
from_station_id: Integer (ID of the station where the trip originated)
from_station_name: String (name of the station where the trip originated)
to_station_id: Integer (ID of the station where the trip ended)
to_station_name: String (name of the station where the trip ended)
usertype: String (type of user, either “Subscriber” or “Customer”)
gender: String (gender of the user, either “Male”, “Female”, or blank)
birthyear: Integer (birth year of the user, or blank)
rideable_type: String (type of bike, either “classic” or “electric”)
start_lat: Float (latitude of the start location)
start_lng: Float (longitude of the start location)
end_lat: Float (latitude of the end location)
end_lng: Float (longitude of the end location)
Note: Some of these features may not be provided in some tables at all, depending on the year and the month of the data. The features that may be missing are: bikeid, rideable_type, tripduration, start_lat, start_lng, end_lat, end_lng, gender, and birthyear.
In the other hand, The Station data is geographically organized by latitude and longitude, with each row containing information for a specific station. Each station has the following features:
name: the name of the station
latitude: the latitude of the station location
longitude: the longitude of the station location
dpcapacity: the number of total docks at each station as of the date of the file online
date: the date the station went live in the system
This table summarize the structure of all the datasets in 3 ways: its content, its timeframe, and its features
To address licensing, we are following the terms and conditions of the Data License Agreement provided by Bikeshare on this link: https://Cyclisticbikes.com/data-license-agreement. This agreement grants us a non-exclusive, royalty-free, limited, perpetual license to access, reproduce, analyze, copy, modify, distribute in our product or service and use the data for any lawful purpose. However, it also prohibits us from doing certain things, such as using the data in an unlawful manner, hosting or selling the data as a stand-alone dataset, accessing the data by unauthorized means, circumventing any access restrictions, using data mining or extraction methods, attempting to correlate the data with personal information of users, and stating or implying that we are affiliated or endorsed by Bikeshare. We are also respecting the trademarks and trade names of Bikeshare, the City of Chicago, and any sponsor of the Cyclistic service, and not using them without written permission. For the purpose of this case study, we are referring to the company as Cyclistics.
To address privacy, we are protecting the personal information of Cyclistic users, such as their genders and birth years, and not disclosing or sharing it with anyone. We are also not attempting to identify or correlate the users with their names, addresses, or other information. We are only using the data for analytical purposes and not for any malicious or harmful purposes.
To address security, we are storing the data in a secure location, such as a cloud-based platform or a password-protected device, and not exposing it to unauthorized access or tampering. We are also using encryption and authentication methods to ensure the integrity and confidentiality of the data.
To address accessibility, we are making the data available and understandable to our intended audience, such as our stakeholders, clients, or customers, and not restricting or limiting their access to the data. We are also using appropriate formats and tools to present and visualize the data, such as tables, charts, graphs, or dashboards, and providing clear and concise explanations and interpretations of the data.
Licensing:
Privacy:
Security:
Accessibility
ROCCC stands for Reliable, Original, Comprehensive, Current, and Cited. It is a framework for evaluating the quality and credibility of data sources.
Reliable: The data is reliable if it is consistent, accurate, and trustworthy. To assess the reliability of the data, I would check the following:
The data source: The data is owned by the City of Chicago and licensed by Bikeshare, the operator of the Cyclitics bicycle sharing service. These are reputable and official entities that have an interest in providing accurate and transparent data to the public. Therefore, the data source is reliable.
The data collection method: The data is collected automatically by the Cyclitics system, which records the details of each bike trip taken by the users. The data collection method is reliable, as it minimizes the possibility of human error or manipulation.
The data quality: The data is mostly complete and consistent, with only a few missing or blank values. The data quality is reliable, as it does not have major errors or anomalies that could affect the analysis.
Original: The data is original if it is the primary source of information, and not derived or copied from another source. To assess the originality of the data, I would check the following:
The data provenance: The data is directly provided by the City of Chicago and Bikeshare, which are the original owners and providers of the data. The data provenance is original, as it does not involve any intermediaries or third parties that could alter or modify the data.
The data format: The data is provided in CSV format, which is a common and standard format for storing and sharing tabular data. The data format is original, as it does not require any conversion or transformation that could affect the data integrity.
Comprehensive: The data is comprehensive if it covers all the relevant aspects and dimensions of the topic or question. To assess the comprehensiveness of the data, I would check the following:
The data scope: The data covers all the bike trips taken by Cyclistic users from June 2013 to Jan 2024, in the city of Chicago and its surrounding areas. The data scope is comprehensive, as it provides a large and diverse sample of the bike-sharing service usage and behavior.
The data features: The data contains various features that describe each bike trip, such as the trip_id, start_time, end_time, bike_id, trip_duration, from_station_name, to_station_name, user_type, rideable_type, start_lat, start_lng, end_lat, and end_lng. The data features are comprehensive, as they provide a rich and detailed information about the bike trips and the users (but not user personal information)
Current: The data is current if it is up-to-date and relevant to the present time. To assess the currency of the data, I would check the following:
The data timeliness: The data is updated twice per year, and the latest available data is from Jan 2024. The data timeliness is current, as it reflects the most recent status and trends of the bike-sharing service.
The data frequency: In the last 3 year, the data is provided monthly, which is a reasonable and appropriate frequency for the bike-sharing service. The data frequency is current, as it allows for timely and periodic analysis of the data.
Cited: The data is cited if it provides the source and the license of the data, and acknowledges the rights and responsibilities of the data users and providers. To assess the citation of the data, I would check the following:
The data source: The data source is clearly stated and linked on the data website, which is https://Cyclistic-tripdata.s3.amazonaws.com/index.html. The data source is cited, as it gives credit and recognition to the data owners and providers.
The data license: The data license is also clearly stated and linked on the data website, which is https://Cyclisticbikes.com/data-license-agreement. The data license is cited, as it specifies the terms and conditions ofthe data usage and sharing
In order to answer the business question of how annual members and casual riders use Cyclistic bikes differently, we will focus on the data from the last 12 months (2023 Q1 to 2023 Q4). This data is relevant and sufficient to capture the seasonal and contextual variations in the bike usage patterns. We will exclude the data from 2013 to 2022, as it is not necessary for our analysis and may introduce noise or bias.
This data subset provides features like: ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng and member_casual helps us answer our question, which is: How do annual members and casual riders use Cyclistics bikes differently in various ways
By analyzing the data, we can compare and contrast the behavior and preferences of the two user types, such as:
How often they use Cyclistics bikes, and how long their trips are.
When they use Cyclistics bikes, and how seasonality or weather affects their usage.
they use Cyclistics bikes, and which stations or routes are most popular or least popular among them.
Why they use Cyclistics bikes, and what are their motivations or goals for using them.
By answering these questions, we can gain insights into the needs and wants of the users, and provide recommendations on how to increase the number of annual members, retain the existing ones, and improve the overall customer satisfaction and loyalty.Data Processing
There are some minor problems with the data that could affect the analysis and the results, but they can be fixed or avoided with some data cleaning and preparation steps. These problems are related to the data structure, the data completeness, and the data currency. Here are some examples of the problems:
Data currency: The data for the last three years (since 2020) missed some of the information that was there in previous data sets, such as bikeid, gender, and birth year. This could lead to the gender and age group insight to be outdated or inaccurate, as the data may not reflect the current or the recent status and trends of the bike-sharing service or the users. This could also affect the comparability or the consistency of the data, as the data from different years may have different structures or features.
Data types mismatch: Another issue observed in the dataset is the presence of data types mismatch. For example, certain columns like ‘started_at’ and ‘ended_at’ are currently stored as character strings instead of datetime objects. This discrepancy can hinder accurate time-based analysis, as operations such as sorting or calculating time intervals may not be performed correctly. Addressing this issue through proper conversion of data types will contribute to the accuracy and reliability of temporal analyses.
Documenting the steps taken to clean and manipulate the data to ensure its accuracy and relevance to the analysis.
We’re working with data from 2023, split into 12 monthly files. Luckily, these files share identical Structure, including the available features, their names, their data types, and their possible values. This enables the seamless merging of all CSV files into a single, comprehensive dataset. Managing and cleaning a unified dataset proves more efficient than addressing each monthly file in isolation
Once merged, our data cleaning process encompasses addressing data types, handling missing values, resolving duplicates, identifying outliers, and rectifying any other potential errors encountered during the merging process. Additionally, we will introduce new features essential for addressing the core business question.
This holistic approach ensures the data is not only consolidated but also refined for robust analysis and insightful decision-making.
Using R, we will merge all the 12 csv into a single one.
First, I need to prepare the R envornement by installing the necessary packages for the upcoming tasks:
# Install and load necessary packages
required_packages <- c("dplyr", "here", "knitr", "readr", "ggplot2", "leaflet", "gridExtra", "lubridate", "geosphere")
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## here() starts at D:/cycliticsBikeShareAnalysis/Cyclistic_Analysis
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Warning: package 'geosphere' was built under R version 4.4.0
I moved all the necessary CSV files into one folder named “Subset_2023”. Now I’ll merge them using the dplyr’s functions. But first, I must change the working directory:
After running the code, we check the structure and summary of the merged data set to ensure everything is in order
str(merged_data)
## 'data.frame': 5719877 obs. of 13 variables:
## $ ride_id : chr "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
## $ rideable_type : chr "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2023-01-21 20:05:42" "2023-01-10 15:37:36" "2023-01-02 07:51:57" "2023-01-22 10:52:58" ...
## $ ended_at : chr "2023-01-21 20:16:33" "2023-01-10 15:46:05" "2023-01-02 08:05:11" "2023-01-22 11:01:44" ...
## $ start_station_name: chr "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
## $ start_station_id : chr "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
## $ end_station_name : chr "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
## $ end_station_id : chr "202480.0" "TA1308000002" "599" "TA1308000002" ...
## $ start_lat : num 41.9 41.8 42 41.8 41.8 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.8 42 41.8 41.8 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual : chr "member" "member" "casual" "member" ...
skimr::skim_without_charts(merged_data)
| Name | merged_data |
| Number of rows | 5719877 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5719877 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4823909 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4835702 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 875716 | 1593 | 0 |
| start_station_id | 0 | 1 | 0 | 35 | 875848 | 1517 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 929202 | 1598 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 929343 | 1521 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.63 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.94 | -87.66 | -87.64 | -87.63 | -87.46 |
| end_lat | 6990 | 1 | 41.90 | 0.05 | 0.00 | 41.88 | 41.90 | 41.93 | 42.18 |
| end_lng | 6990 | 1 | -87.65 | 0.07 | -88.16 | -87.66 | -87.64 | -87.63 | 0.00 |
In the dataset, two features, end_lat
and end_lng, exhibit a consistent count of
missing values, totaling 6,990 records out of the 5,202,222 total
entries. While it is common practice to consider removing records with
missing values, in this case, a deliberate decision was made not to
discard entire records. The rationale behind this choice is that the
missing values in end_lat and
end_lng appear to be independent of other
features and do not necessarily indicate a data integrity issue.
Removing entire records might lead to a significant loss of information,
especially considering the large dataset size. Instead, strategies such
as imputation or further investigation into the nature of the missing
values will be explored to ensure a more nuanced and informed handling
of the data.
Here, I’ll check if there are any relation between the start_station_name and the start_lat, start_lng. If there are a relation, it will help me fill the missing values.
lincoln_fullerton_coordinates <- merged_data %>%
filter(start_station_name == "Lincoln Ave & Fullerton Ave") %>%
select(start_station_name, start_lat, start_lng)
head(lincoln_fullerton_coordinates)
## start_station_name start_lat start_lng
## 1 Lincoln Ave & Fullerton Ave 41.92407 -87.64628
## 2 Lincoln Ave & Fullerton Ave 41.92427 -87.64634
## 3 Lincoln Ave & Fullerton Ave 41.92431 -87.64657
## 4 Lincoln Ave & Fullerton Ave 41.92422 -87.64643
## 5 Lincoln Ave & Fullerton Ave 41.92416 -87.64638
## 6 Lincoln Ave & Fullerton Ave 41.92419 -87.64626
unique_start_lat <- unique(lincoln_fullerton_coordinates$start_lat)
unique_start_lng <- unique(lincoln_fullerton_coordinates$start_lng)
# Print unique values for start_lat
print(length(unique_start_lat))
## [1] 3299
# Print unique values for start_lng
print(length(unique_start_lng))
## [1] 4238
It clearly appears that for one station, there are thousands of possible start_lat and start_lng values, which indicates that those two geometric records are not for the station were the the trip starts, in fact, It’s possibly related to where the bike specificaly start the ride. So, it’s mostly impossible to figure out what those missing values are.
# Station coordinates
station_lat <- 41.925905
station_lng <- -87.64926
# Calculate mean latitude and longitude for all points
average_lat <- mean(lincoln_fullerton_coordinates$start_lat, na.rm = TRUE)
average_lng <- mean(lincoln_fullerton_coordinates$start_lng, na.rm = TRUE)
# Create a leaflet map
leaflet(data = lincoln_fullerton_coordinates) %>%
addTiles() %>%
addCircleMarkers(lng = ~start_lng, lat = ~start_lat, radius = 1, color = "orange",
label = ~paste("Trip\nLat: ", start_lat, "\nLng: ", start_lng)) %>%
addCircleMarkers(lng = station_lng, lat = station_lat, radius = 5, color = "green",
label = "Lincoln Ave & Fullerton Ave Station\nLat: 41.925905\nLng: -87.64926") %>%
addCircleMarkers(lng = average_lng, lat = average_lat, radius = 5, color = "blue",
label = ~paste("Average\nLat: ", round(average_lat, 6), "\nLng: ", round(average_lng, 6)))
A reasonable approach is to fill the missing end_lat and end_lng with the average of available end_lat and end_lng of other trips ended in the same station.
# Create a new data frame with averages for each end station
station_averages <- merged_data %>%
select(end_station_name, end_lat, end_lng) %>%
group_by(end_station_name) %>%
summarize(
avg_end_lat = mean(end_lat, na.rm = TRUE),
avg_end_lng = mean(end_lng, na.rm = TRUE)
)
head(station_averages)
## # A tibble: 6 × 3
## end_station_name avg_end_lat avg_end_lng
## <chr> <dbl> <dbl>
## 1 "" 41.9 -87.7
## 2 "2112 W Peterson Ave" 42.0 -87.7
## 3 "410" 41.9 -87.7
## 4 "63rd St Beach" 41.8 -87.6
## 5 "900 W Harrison St" 41.9 -87.6
## 6 "Aberdeen St & Jackson Blvd" 41.9 -87.7
# Merge the averages back into the original dataset for missing values
merged_data <- merged_data %>%
left_join(station_averages, by = "end_station_name") %>%
mutate(
end_lat = ifelse(is.na(end_lat), avg_end_lat, end_lat),
end_lng = ifelse(is.na(end_lng), avg_end_lng, end_lng)
) %>%
select(-avg_end_lat, -avg_end_lng)
skimr::skim_without_charts(merged_data)
| Name | merged_data |
| Number of rows | 5719877 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5719877 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4823909 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4835702 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 875716 | 1593 | 0 |
| start_station_id | 0 | 1 | 0 | 35 | 875848 | 1517 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 929202 | 1598 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 929343 | 1521 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.63 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.94 | -87.66 | -87.64 | -87.63 | -87.46 |
| end_lat | 0 | 1 | 41.90 | 0.05 | 0.00 | 41.88 | 41.90 | 41.93 | 42.18 |
| end_lng | 0 | 1 | -87.65 | 0.07 | -88.16 | -87.66 | -87.64 | -87.63 | 0.00 |
Now, lets compare the location of some trips started from the same station with average location of all those trips and the station location.
Chicago_Ave_Dempster_St <- merged_data %>%
filter(start_station_name == "900 W Harrison St") %>%
select(start_station_name, start_lat, start_lng)
# Station coordinates
station_lat <- 41.874754
station_lng <- -87.649807
# Calculate mean latitude and longitude for all points
average_lat <- mean(Chicago_Ave_Dempster_St$start_lat, na.rm = TRUE)
average_lng <- mean(Chicago_Ave_Dempster_St$start_lng, na.rm = TRUE)
# Create a leaflet map
leaflet(data = Chicago_Ave_Dempster_St) %>%
addTiles() %>%
addCircleMarkers(lng = ~start_lng, lat = ~start_lat, radius = 1, color = "red",
label = ~paste("Trip\nLat: ", start_lat, "\nLng: ", start_lng)) %>%
addCircleMarkers(lng = station_lng, lat = station_lat, radius = 5, color = "black",
label = "900 W Harrison St Station\nLat: 41.925905\nLng: -87.64926") %>%
addCircleMarkers(lng = average_lng, lat = average_lat, radius = 5, color = "yellow",
label = ~paste("Average\nLat: ", round(average_lat, 6), "\nLng: ", round(average_lng, 6)))
Duplicates must be eliminated and only keep the first occurrence. Let’s first check if the re are any duplicates in the dateset.
# Identifying duplicate records in the dataset
length_before <- nrow(merged_data)
duplicates <- merged_data %>%
distinct()
print(paste0("Num of Duplicates: ", length_before - nrow(merged_data)))
## [1] "Num of Duplicates: 0"
To guarantee the correctness of the data types in my dataset, I
initiated a comprehensive evaluation of each column using the
str() function and
glimpse() function from the
dplyr package. This allowed me to gain a
detailed understanding of the data structure, including the current data
types associated with each column.
# Gaining a brief overview of the dataset
glimpse(merged_data)
## Rows: 5,719,877
## Columns: 13
## $ ride_id <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
## $ rideable_type <chr> "electric_bike", "classic_bike", "electric_bike", "…
## $ started_at <chr> "2023-01-21 20:05:42", "2023-01-10 15:37:36", "2023…
## $ ended_at <chr> "2023-01-21 20:16:33", "2023-01-10 15:46:05", "2023…
## $ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
## $ start_station_id <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
## $ end_station_name <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
## $ end_station_id <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
## $ start_lat <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957, 4…
## $ start_lng <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.594…
## $ end_lat <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983, 4…
## $ end_lng <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.599…
## $ member_casual <chr> "member", "member", "casual", "member", "member", "…
Upon identifying columns requiring a data type conversion, I
developed a targeted strategy for transformation. Leveraging the
mutate() function from the
dplyr package, I employed appropriate
conversion methods for each column.
# Convert data types
merged_data <- merged_data %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type),
started_at = as.POSIXct(started_at),
ended_at = as.POSIXct(ended_at),
start_station_name = as.character(start_station_name),
start_station_id = as.character(start_station_id),
end_station_name = as.character(end_station_name),
end_station_id = as.character(end_station_id),
member_casual = as.factor(member_casual)
)
In this step, I enhanced the dataset by introducing several new features to extract valuable insights and improve the analytical capabilities. The following features were created:
duration_minutes (minutes):
hour (0 - 24):
day_of_week(1 - 7, where 1 is Saterday):
day_of_month (1 - 31):
length_km (kilometers):
month(Jun - Dec):
trip_id
Even though the data has it’s own trip_id, but for optimization purposes, I will create new id replacing the original with one that is simpler and not taking too much space
These additional features aim to provide a more detailed and nuanced perspective on the biking patterns, facilitating deeper analysis and interpretation of the dataset.
# Time duration (in minutes)
merged_data$duration_minutes <- round(as.numeric(difftime(merged_data$ended_at, merged_data$started_at, units = "mins")))
# Hour (0 - 24)
merged_data$hour <- hour(merged_data$started_at)
# Day of week (1 - 7, where 1 is Sunday)
merged_data$day_of_week <- wday(merged_data$started_at, label = TRUE)
# Day of month (1 - 30)
merged_data$day_of_month <- day(merged_data$started_at)
# Length of the trip (in kilometers, assuming start and end coordinates are in decimal degrees)
merged_data$trip_length_km <- geosphere::distHaversine(
cbind(merged_data$start_lng, merged_data$start_lat),
cbind(merged_data$end_lng, merged_data$end_lat)
) / 1000
# Month (Jun - Dec)
merged_data$month <- month(merged_data$started_at, label = TRUE)
# Adding trip_ID column
merged_data$trip_ID <- seq_len(nrow(merged_data))
# Display the modified dataset
head(merged_data)
## ride_id rideable_type started_at ended_at
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## start_station_name start_station_id end_station_name
## 1 Lincoln Ave & Fullerton Ave TA1309000058 Hampden Ct & Diversey Ave
## 2 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St
## 3 Western Ave & Lunt Ave RP-005 Valli Produce - Evanston Plaza
## 4 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St
## 5 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St
## 6 Lakeview Ave & Fullerton Pkwy TA1309000019 Hampden Ct & Diversey Ave
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 202480.0 41.92407 -87.64628 41.93000 -87.64000 member
## 2 TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
## 3 599 42.00857 -87.69048 42.03974 -87.69941 casual
## 4 TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
## 5 TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
## 6 202480.0 41.92607 -87.63886 41.93000 -87.64000 member
## duration_minutes hour day_of_week day_of_month trip_length_km month trip_ID
## 1 11 20 Sat 21 0.8399817 Jan 1
## 2 8 15 Tue 10 1.2059229 Jan 2
## 3 13 7 Mon 2 3.5476523 Jan 3
## 4 9 10 Sun 22 1.2059229 Jan 4
## 5 15 13 Thu 12 1.2059229 Jan 5
## 6 3 7 Tue 31 0.4477086 Jan 6
# Create a box plot
ggplot(merged_data, aes(y = duration_minutes)) +
geom_boxplot() +
labs(y = "duration_minutes") +
theme_minimal()
negative_duration <- merged_data[merged_data$duration_minutes < 0, ]
print(nrow(negative_duration))
## [1] 128
I see some data points under 0. possibly because the start and end time have been switched accidentally. So an easy approach is to switch their sign or remove them. I prefer to keep them because every data point is valuable.
# Switch the sign of duration_minutes for negative values
merged_data$duration_minutes[merged_data$duration_minutes < 0] <- -merged_data$duration_minutes[merged_data$duration_minutes < 0]
# Create a box plot
ggplot(merged_data, aes_string(y = "duration_minutes")) +
geom_boxplot() +
labs(y = "duration_minutes") +
theme_minimal()
# Create a box plot
ggplot(merged_data, aes_string(y = "trip_length_km")) +
geom_boxplot() +
labs(y = "trip_length_km") +
theme_minimal()
Some of the data points has a very long trip length (about 10000)
# Find rows with trip_length_km greater than a threshold (e.g., 100 km)
outliers <- merged_data$trip_length_km > 100
merged_data[outliers, ]
## ride_id rideable_type started_at ended_at
## 2149866 873D50153BBC0686 electric_bike 2023-06-15 12:38:05 2023-06-15 12:38:41
## 2159177 ADFF57D27B5BF9D2 classic_bike 2023-06-15 09:38:07 2023-06-15 09:42:57
## 3413676 43107577DF9B498D classic_bike 2023-08-21 18:43:22 2023-08-21 22:05:55
## start_station_name start_station_id
## 2149866 OH Charging Stx - Test OH Charging Stx - Test
## 2159177 State St & 54th St 646
## 3413676 Dearborn St & Erie St 13045
## end_station_name end_station_id start_lat start_lng
## 2149866 OH Charging Stx - Test OH Charging Stx - Test 41.86317 -87.67981
## 2159177 OH Charging Stx - Test OH Charging Stx - Test 41.79664 -87.62592
## 3413676 Stony Island Ave & 63rd St 653B 41.89399 -87.62932
## end_lat end_lng member_casual duration_minutes hour day_of_week
## 2149866 0 0 member 1 12 Thu
## 2159177 0 0 casual 5 9 Thu
## 3413676 0 0 casual 203 18 Mon
## day_of_month trip_length_km month trip_ID
## 2149866 15 9826.424 Jun 2149866
## 2159177 15 9821.754 Jun 2159177
## 3413676 21 9822.335 Aug 3413676
The reason of these very long trip length is that the trip
destination end_lng and end_lat are
wrongly entered as zeros. This point is known as Null Island which is
located in the Gulf of Guinea, making the calculated distance very
long.
A reasonable approach is to calculate the average end_lat and end_lng
for each end_station_name, then join these averages with the original
data based on the station name. Finally, replaces the zeros in end_lat
and end_lng with the corresponding averages.
# Calculate the average end_lat and end_lng for each end_station_name
station_averages <- merged_data %>%
filter(end_lat != 0 & end_lng != 0) %>% # Exclude rows with zeros
group_by(end_station_name) %>%
summarize(avg_end_lat = mean(end_lat),
avg_end_lng = mean(end_lng))
# Join the average values with the original data and replace zeros with averages
merged_data <- merged_data %>%
left_join(station_averages, by = "end_station_name") %>%
mutate(end_lat = ifelse(end_lat == 0, avg_end_lat, end_lat),
end_lng = ifelse(end_lng == 0, avg_end_lng, end_lng)) %>%
select(-avg_end_lat, -avg_end_lng) # Remove the temporary columns
# Check if zeros are replaced
summary(merged_data$end_lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41.61 41.88 41.90 41.90 41.93 42.18
summary(merged_data$end_lng)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -88.16 -87.66 -87.64 -87.65 -87.63 -87.33
Let’s recalculate the trip_length_km with the most recent changes
# recalcultes the trip_length_km
merged_data$trip_length_km <- geosphere::distHaversine(
cbind(merged_data$start_lng, merged_data$start_lat),
cbind(merged_data$end_lng, merged_data$end_lat)
) / 1000
# Create a box plot
ggplot(merged_data, aes_string(y = "trip_length_km")) +
geom_boxplot() +
labs(y = "trip_length_km") +
theme_minimal()
At last, I removed all the unnecessary features, keeping only those are relevant the the business task including:
trip_Id, rideable_type, member_casual, duration_minutes, hour, day_of_week, day of month, month, trip_length_km.
Doing so will reduce our data frame size making the analysis process much faster and reduce lags and waiting time.
merged_data <- merged_data[, c("trip_ID", "rideable_type", "member_casual", "duration_minutes", "hour", "day_of_week", "day_of_month", "month", "trip_length_km")]
head(merged_data)
## trip_ID rideable_type member_casual duration_minutes hour day_of_week
## 1 1 electric_bike member 11 20 Sat
## 2 2 classic_bike member 8 15 Tue
## 3 3 electric_bike casual 13 7 Mon
## 4 4 classic_bike member 9 10 Sun
## 5 5 classic_bike member 15 13 Thu
## 6 6 electric_bike member 3 7 Tue
## day_of_month month trip_length_km
## 1 21 Jan 0.8399817
## 2 10 Jan 1.2059229
## 3 2 Jan 3.5476523
## 4 22 Jan 1.2059229
## 5 12 Jan 1.2059229
## 6 31 Jan 0.4477086
After a thorough data cleaning process, the dataset stands as a robust and reliable foundation for analysis. Key features, such as ‘Time duration,’ have been rounded to enhance interpretability, while outlier analysis indicates that the numeric variables (‘start_lat,’ ‘start_lng,’ ‘end_lat,’ and ‘end_lng’) showcase a consistent and valid distribution. Missing values in geometric features (‘end_lat’ and ‘end_lng’) were carefully handled to ensure data integrity without resorting to wholesale record removal. The identification and exclusion of duplicates were executed meticulously, ensuring the dataset’s cleanliness. With appropriate data type conversions and meaningful feature engineering, including the creation of new features like ‘Time duration’ rounding, the dataset is now well-prepared for insightful analysis and interpretation, eliminating the need for further cleaning steps.
Presenting supporting visualizations that aid in understanding the data, accompanied by key findings derived from the analysis.
glimpse(merged_data)
## Rows: 5,719,877
## Columns: 9
## $ trip_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ rideable_type <chr> "electric_bike", "classic_bike", "electric_bike", "cl…
## $ member_casual <fct> member, member, casual, member, member, member, membe…
## $ duration_minutes <dbl> 11, 8, 13, 9, 15, 3, 14, 9, 12, 13, 10, 9, 8, 10, 5, …
## $ hour <int> 20, 15, 7, 10, 13, 7, 21, 10, 20, 16, 17, 17, 19, 20,…
## $ day_of_week <ord> Sat, Tue, Mon, Sun, Thu, Tue, Sun, Wed, Wed, Fri, Thu…
## $ day_of_month <int> 21, 10, 2, 22, 12, 31, 15, 25, 25, 6, 5, 3, 9, 3, 12,…
## $ month <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan…
## $ trip_length_km <dbl> 0.8399817, 1.2059229, 3.5476523, 1.2059229, 1.2059229…
In my quest to differentiate between member and casual users, I have adopted a comprehensive approach by utilizing a combination of total, average, and count metrics. This strategic choice enables me to gain a nuanced understanding of various aspects of user behavior, ensuring a well-rounded analysis.
Total Metrics: Capturing Magnitude and Impact
The inclusion of total metrics, such as the total number of trips or total distance covered, allows us to comprehend the overall magnitude and cumulative impact of user activity. This approach provides insights into the absolute contribution of each user type, aiding in the assessment of their respective scales of engagement with the bike-sharing service.
Average Metrics: Unveiling Typical Behavior
Concurrently, we employ average metrics, such as average trip duration or average distance per trip, to reveal the typical behavior exhibited by each user type. Calculating averages normalizes the data, providing a balanced perspective that considers variations in dataset sizes. This is particularly valuable for understanding the standard or average experience of users, irrespective of the overall volume of their activity.
Count Metrics: Quantifying Frequency
In addition to total and average metrics, we incorporate count metrics to quantify the frequency of user interactions. Counting the number of trips or occurrences provides a straightforward measure of user engagement, complementing the insights derived from total and average metrics.
By strategically combining total, average, and count metrics, our analysis aims to uncover patterns, preferences, and distinctions between member and casual users. This multifaceted approach ensures a comprehensive exploration of user behavior, empowering us to address the overarching business question: “How do member and casual users differ in their utilization of our bike-sharing service?”
skimr::skim_without_charts(merged_data)
| Name | merged_data |
| Number of rows | 5719877 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| member_casual | 0 | 1 | FALSE | 2 | mem: 3660698, cas: 2059179 |
| day_of_week | 0 | 1 | TRUE | 7 | Sat: 883566, Thu: 860202, Fri: 843524, Wed: 835625 |
| month | 0 | 1 | TRUE | 12 | Aug: 771693, Jul: 767650, Jun: 719618, Sep: 666371 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| trip_ID | 0 | 1 | 2859939.00 | 1651186.41 | 1 | 1429970.00 | 2859939.00 | 4289908.00 | 5719877.00 |
| duration_minutes | 0 | 1 | 18.20 | 180.84 | 0 | 5.00 | 10.00 | 17.00 | 98489.00 |
| hour | 0 | 1 | 14.09 | 4.94 | 0 | 11.00 | 15.00 | 18.00 | 23.00 |
| day_of_month | 0 | 1 | 15.54 | 8.74 | 1 | 8.00 | 15.00 | 23.00 | 31.00 |
| trip_length_km | 0 | 1 | 2.10 | 1.91 | 0 | 0.86 | 1.54 | 2.76 | 48.98 |
The final data has these features:
Numerical: duration_minutes, trip_length_km
Categorical: trip_IDraidable_type, member_casual, hour, day, weekday, month
All the categorical time features(hour, day, weekday, month) are extracted from the started_at datetime feature. And the reason why I chose this one over the ended_at is that we only interested in the time that each trip started.
The calculated trip_length_km feature is the result of measuring the distance between the start and the end coordinates of each trip, but that doesn’t mean that this is the real trip length because if someone starts and finish the trip in the same location taking an hour in total, it doesn’t mean the trip length is 0!!!, so the other factor that needs to be ware of is the duration_minutes
ggplot(merged_data) + geom_point(mapping = aes(x=merged_data$duration_minutes, y=merged_data$trip_length_km))
## Warning: Use of `merged_data$duration_minutes` is discouraged.
## ℹ Use `duration_minutes` instead.
## Warning: Use of `merged_data$trip_length_km` is discouraged.
## ℹ Use `trip_length_km` instead.
Over all, the trip_length_km is a good indecator that may gives us some hints about the difference between the casual users and the annual members
Q: How frequently do members and casual users book trips?
# Summarize total count of trips for each user type
user_trip_count <- merged_data %>%
group_by(member_casual) %>%
summarize(total_trips = n())
# Calculate percentage
user_trip_count$percentage <- user_trip_count$total_trips / sum(user_trip_count$total_trips) * 100
# Plot a pie chart
ggplot(user_trip_count, aes(x = "", y = total_trips, fill = member_casual)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "User Type Distribution",
x = NULL,
y = NULL,
fill = "User Type") +
theme_minimal() +
geom_text(aes(label = paste0(total_trips, " (", round(percentage, 1), "%)")),
position = position_stack(vjust = 0.5),
color = "black",
size = 3)
Based on the pie chart, casual riders account for 36% of the total trips, while annual members account for 64% of the total trips.
I observe a distinct distribution in trip booking frequency between annual members and casual riders. An overwhelming majority of trips, constituting 64% of the total, are booked by annual members, showcasing their significant contribution to the bike-sharing service. In contrast, casual riders account for the remaining 36% of trips. This disparity underscores the strong engagement and consistent utilization of the service by annual members compared to occasional users.
Q: How does each user type use different bike types for their trips?
# Calculate average duration for each user/bike type
bike_type_avg_duration <- merged_data %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Calculate average trip length for each user/bike type
bike_type_avg_length <- merged_data %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_length = mean(trip_length_km, na.rm = TRUE))
# Calculate total duration for each user/bike type
bike_type_total_duration <- merged_data %>%
group_by(member_casual, rideable_type) %>%
summarize(total_duration = sum(duration_minutes))
# Calculate total trip length for each user/bike type
bike_type_total_length <- merged_data %>%
group_by(member_casual, rideable_type) %>%
summarize(total_length = sum(trip_length_km))
# Calculate distribution of trips for each user/bike type
bike_type_trip_distribution <- merged_data %>%
group_by(member_casual, rideable_type) %>%
summarize(trip_count = n())
# Plot all six graphs
plot1 <- ggplot(bike_type_avg_duration, aes(x = member_casual, y = avg_duration, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(avg_duration, 2)), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + # Add labels
labs(title = "Average Duration for Each User/Bike Type",
x = "User Type",
y = "Average Duration (minutes)",
fill = "Bike Type") +
theme_minimal()
plot2 <- ggplot(bike_type_avg_length, aes(x = member_casual, y = avg_length, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(avg_length, 2)), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + # Add labels
labs(title = "Average Trip Length for Each User/Bike Type",
x = "User Type",
y = "Average Trip Length (km)",
fill = "Bike Type") +
theme_minimal()
plot3 <- ggplot(bike_type_total_duration, aes(x = member_casual, y = total_duration, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = total_duration), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + # Add labels
labs(title = "Total Duration for Each User/Bike Type",
x = "User Type",
y = "Total Duration (minutes)",
fill = "Bike Type") +
theme_minimal()
plot4 <- ggplot(bike_type_total_length, aes(x = member_casual, y = total_length, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(total_length, 2)), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + # Add labels
labs(title = "Total Trip Length for Each User/Bike Type",
x = "User Type",
y = "Total Trip Length (km)",
fill = "Bike Type") +
theme_minimal()
plot5 <- ggplot(bike_type_trip_distribution, aes(x = member_casual, y = trip_count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = trip_count), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + # Add labels
labs(title = "Distribution of Trips for Each User/Bike Type",
x = "User Type",
y = "Trip Count",
fill = "Bike Type") +
theme_minimal()
# Arrange plots in a grid
grid.arrange(plot1, plot2, plot3, plot4, plot5, nrow = 3, ncol = 2)
In general, both user segments exhibit distinct preferences in bike usage, with a notable inclination towards electric bikes for long-distance trips and classic bikes for longer duration journeys. While casual users are the sole users of docked bikes, both user types employ classic and electric bikes differently. On average, casual users tend to utilize classic bikes for longer durations but cover shorter distances compared to annual members. Conversely, annual members favor electric bikes for both longer distances and durations relative to casual users. These preferences suggest a nuanced usage pattern influenced by user habits and needs. One assumption is that casual users may prioritize leisurely rides or short-distance errands, hence their preference for longer-duration trips on classic bikes. Conversely, annual members, likely using bikes for commuting or longer recreational outings, may opt for electric bikes to cover greater distances more efficiently. Additionally, factors such as commuting habits, accessibility to bike stations, and personal preferences may further influence these usage patterns.
Q: What is the average distance covered by each user type in their trips?
Q: What is the total distance covered by each user type in their trips?
# Summarize average distance for each user type
avg_distance_summary <- merged_data %>%
group_by(member_casual) %>%
summarize(avg_distance = mean(trip_length_km))
# Plot horizontal bar chart for average distance
plot1 <- ggplot(avg_distance_summary, aes(x = avg_distance, y = reorder(member_casual, -avg_distance), fill = member_casual)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(round(avg_distance, 2))), hjust = -0.1, color = "black", size = 3) + # Add labels with comma formatting
labs(title = "Average Distance for Each User Type",
x = "Average Distance (km)",
y = "User Type",
fill = "User Type") +
theme_minimal()
# Summarize total distance for each user type
total_distance_summary <- merged_data %>%
group_by(member_casual) %>%
summarize(total_distance = sum(trip_length_km))
# Plot horizontal bar chart for total distance
plot2 <- ggplot(total_distance_summary, aes(x = total_distance, y = reorder(member_casual, -total_distance), fill = member_casual)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(round(total_distance, 2))), hjust = 1.1, color = "black", size = 3) + # Add labels with comma formatting
labs(title = "Total Distance Spent by Each User Type",
x = "Total Distance (km)",
y = "User Type",
fill = "User Type") +
theme_minimal()
grid.arrange(plot1, plot2, nrow=2, ncol=1)
Insights:
The average trip distance for casual users (2.12 kilometers) is slightly equal to the average trip distance for members (2.10 kilometers). However, members traveled a significantly greater total distance (7,672,741 kilometers) than casual users (4,362,427 kilometers). This suggests that members take more frequent trips overall, even though their average trip distance might be shorter. It’s important to note that this graph only shows average distance and total distance, and doesn’t account for the distribution of trip distances or the size of the user base.
Q: What is the average duration spent on trips by each user type?
Q: What is the total duration spent on trips by each user type?
# Calculate average time duration for each user type
user_avg_duration <- merged_data %>%
group_by(member_casual) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Plot the results
plot1 <- ggplot(user_avg_duration, aes(x = member_casual, y = avg_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(avg_duration, 2)), position = position_stack(vjust = 0.5), color = "black", size = 3) + # Add labels
labs(title = "Average Trip Duration for each User Type",
x = "User Type",
y = "Average Time Duration (minutes)",
fill = "User Type") +
theme_minimal()
# Summarize total time duration for each user type
user_total_duration <- merged_data %>%
group_by(member_casual) %>%
summarize(total_duration = sum(duration_minutes, na.rm = TRUE))
# Plot the results
plot2 <- ggplot(user_total_duration, aes(x = member_casual, y = total_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(round(total_duration, 2))), position = position_stack(vjust = 0.5), color = "black", size = 3) + # Add labels with comma formatting
labs(title = "Total Trip Duration for each User Type",
x = "User Type",
y = "Total Time Duration (minutes)",
fill = "User Type") +
theme_minimal()
grid.arrange(plot1, plot2, nrow=2, ncol=1)
Insights:
Q: How does each user type use bikes throughout the day (1 - 24)?
# Calculate average duration per hour for each user type
hourly_avg_duration <- merged_data %>%
group_by(hour, member_casual) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Plot the results
plot1 <- ggplot(hourly_avg_duration, aes(x = hour, y = avg_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Duration per Hour",
x = "Hour",
y = "Average Duration (minutes)",
fill = "User Type") +
theme_minimal()
# Calculate total duration per hour for each user type
hourly_total_duration <- merged_data %>%
group_by(hour, member_casual) %>%
summarize(total_duration = sum(duration_minutes, na.rm = TRUE))
# Plot the results
plot2 <- ggplot(hourly_total_duration, aes(x = hour, y = total_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Duration per Hour for Each User Type",
x = "Hour",
y = "Total Duration (minutes)",
fill = "User Type") +
theme_minimal()
# Calculate average distance per hour for each user type
hourly_avg_distance <- merged_data %>%
group_by(hour, member_casual) %>%
summarize(avg_distance = mean(trip_length_km, na.rm = TRUE))
# Plot the results
plot3 <- ggplot(hourly_avg_distance, aes(x = hour, y = avg_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Distance per Hour for Each User Type",
x = "Hour",
y = "Average Distance (km)",
fill = "User Type") +
theme_minimal()
# Calculate total distance per hour for each user type
hourly_total_distance <- merged_data %>%
group_by(hour, member_casual) %>%
summarize(total_distance = sum(trip_length_km, na.rm = TRUE))
# Plot the results
plot4 <- ggplot(hourly_total_distance, aes(x = hour, y = total_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Distance per Hour for Each User Type",
x = "Hour",
y = "Total Distance (km)",
fill = "User Type") +
theme_minimal()
# Calculate distribution of trips throughout the day for each user type
hourly_trip_distribution <- merged_data %>%
group_by(hour, member_casual) %>%
summarize(trip_count = n())
# Plot the results with a stacked area chart
plot5 <- ggplot(hourly_trip_distribution, aes(x = hour, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Trips Throughout the Day",
x = "Hour",
y = "Trip Count",
fill = "User Type") +
theme_minimal()
grid.arrange(plot1, plot2, plot3, plot4, plot5, nrow=3, ncol=2)
Insights
Average Duration: Both user types exhibit relatively stable trip duration throughout the day, with a notable decrease observed between 5 and 7 AM. This dip in average duration may be attributed to the majority of trips during these hours being commuter journeys, characterized by shorter distances and duration.
Average Distance: Throughout the day, both user types cover approximately the same average distance, averaging around 2.0 kilometers per trip. This consistency suggests that regardless of the time of day, users tend to travel similar distances on average.
Total Duration: The distribution of total trip duration follows a bell-shaped curve for both user types, peaking around 3 PM and gradually decreasing thereafter. However, for annual members, there is a notable increase in total trip duration between 7 and 8 AM, corresponding to the morning rush hour commute. Conversely, a significant decrease is observed from 10 AM to 5 PM, indicating reduced bike usage during typical working hours.
Total Distance: Casual users exhibit a normally distributed pattern in total trip distance, peaking around 4 PM and declining gradually on either side. This trend suggests that casual users tend to engage in longer bike trips during the evening, possibly for leisure or recreational purposes. In contrast, annual members show two peaks in total trip length, occurring from 6 to 8 AM and again from 4 to 7 PM. This pattern aligns with commuting behavior, where morning and evening trips are more common for work-related travel.
Distribution of Trips: The distribution of trips throughout the day follows a similar pattern for both user types, with two distinct peaks observed at 7 AM and 5 PM. However, the peak at 5 PM is more pronounced, indicating a higher volume of trips during the evening hours. This dual peak suggests a combination of commuters returning home from work and individuals beginning leisure activities in the evening.
Q: How does each user type use bikes on each day of the week (1 - 7)?
# Calculate average duration for each weekday and user type
weekday_avg_duration_user <- merged_data %>%
group_by(day_of_week, member_casual) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Plot the results with stacked bar chart and data labels
plot1 <- ggplot(weekday_avg_duration_user, aes(x = day_of_week, y = avg_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Duration per Weekday by User Type",
x = "Day of Week",
y = "Average Duration (minutes)",
fill = "User Type") +
theme_minimal()
# Calculate total duration for each weekday and user type
weekday_total_duration_user <- merged_data %>%
group_by(day_of_week, member_casual) %>%
summarize(total_duration = sum(duration_minutes))
# Plot the results with stacked bar chart and data labels
plot2 <- ggplot(weekday_total_duration_user, aes(x = day_of_week, y = total_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Duration per Weekday by User Type",
x = "Day of Week",
y = "Total Duration (minutes)",
fill = "User Type") +
theme_minimal()
# Calculate average distance for each weekday and user type
weekday_avg_distance_user <- merged_data %>%
group_by(day_of_week, member_casual) %>%
summarize(avg_distance = mean(trip_length_km))
# Plot the results with a stacked bar chart and rounded data labels
plot3 <- ggplot(weekday_avg_distance_user, aes(x = day_of_week, y = avg_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Distance per Weekday by User Type",
x = "Day of Week",
y = "Average Distance (km)",
fill = "User Type") +
theme_minimal()
# Calculate total distance for each weekday and user type
weekday_total_distance_user <- merged_data %>%
group_by(day_of_week, member_casual) %>%
summarize(total_distance = sum(trip_length_km))
# Plot the results with a stacked bar chart and rounded data labels
plot4 <- ggplot(weekday_total_distance_user, aes(x = day_of_week, y = total_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Distance per Weekday by User Type",
x = "Day of Week",
y = "Total Distance (km)",
fill = "User Type") +
theme_minimal()
# Calculate trip count for each day of the week and user type
trip_count_per_day_user <- merged_data %>%
group_by(day_of_week, member_casual) %>%
summarize(trip_count = n())
# Plot the stacked bar chart with data labels
plot5 <- ggplot(trip_count_per_day_user, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") + # Add data labels
labs(title = "Distribution of Trips Throughout the Week by User Type",
x = "Day of Week",
y = "Trip Count",
fill = "User Type") +
theme_minimal()
grid.arrange(plot1, plot2, plot3, plot4, plot5, nrow=3, ncol=2)
Insights
Average Duration: Casual users consistently exhibit higher average trip durations compared to annual users. Notably, casual users show a peak in average duration during weekends (32 minutes), while experiencing a decline in the middle of the week (24 minutes). Conversely, annual users demonstrate a similar pattern with slightly lower averages. This discrepancy may be attributed to weekend trips primarily catering to tourist cycling and leisure activities.
Average Distance: Both user types maintain a stable average distance throughout the week, approximately 2.06 km per trip.
Total Duration: On weekdays, the total duration of trips for both user types remains nearly identical. However, during weekends, casual users’ total duration significantly increases, doubling their weekday duration. In contrast, annual members’ total duration remains consistent even on weekends. This trend suggests that casual users engage in more cycling activities for leisure purposes on weekends, while annual users maintain a consistent routine of commuting trips.
Total Distance: Annual users tend to undertake longer trips on weekdays, supporting the notion that they primarily use bikes for commuting to work. Conversely, casual users record lower total distances with a slight increase on weekends, indicating that their trips are shorter in distance but longer in duration, aligning with leisurely activities.
Distribution of Trips: The distribution of trips reveals distinct patterns for each user type. While annual members account for the majority of trips, they tend to book more trips on weekdays (Tuesday to Thursday). In contrast, casual members book more trips on weekends. This observation supports the idea that annual members prioritize weekday commuting, while casual members prefer weekend recreational activities.
Q: How does each user type use bikes every day of the month (1 - 31)?
# Calculate average duration for each day and user type
daily_avg_duration_user <- merged_data %>%
group_by(day_of_month, member_casual) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Define weekends
weekends <- c("Sat", "Sun")
# Plot the results with line chart and vertical lines for weekends
plot_1 <- ggplot(daily_avg_duration_user, aes(x = day_of_month, y = avg_duration, color = member_casual)) +
geom_line(size = 1.5) +
geom_vline(xintercept = which(daily_avg_duration_user$day_of_month %in% weekends), linetype = "dashed", color = "gray") +
labs(title = "Average Duration per Day by User Type",
x = "Day of Month",
y = "Average Duration (minutes)",
color = "User Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 30, by = 3)) # Set breaks every 3 days
# Calculate total duration for each day and user type
daily_total_duration_user <- merged_data %>%
group_by(day_of_month, member_casual) %>%
summarize(total_duration = sum(duration_minutes))
# Plot the results with line chart
plot_2 <- ggplot(daily_total_duration_user, aes(x = day_of_month, y = total_duration, color = member_casual)) +
geom_line(size = 1.5) +
labs(title = "Total Duration per Day by User Type",
x = "Day of Month",
y = "Total Duration (minutes)",
color = "User Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 30, by = 3)) # Set breaks every 3 days
# Calculate average distance for each day and user type
daily_avg_distance_user <- merged_data %>%
group_by(day_of_month, member_casual) %>%
summarize(avg_distance = mean(trip_length_km, na.rm = TRUE))
# Plot the results with line chart
plot_3 <- ggplot(daily_avg_distance_user, aes(x = day_of_month, y = avg_distance, color = member_casual)) +
geom_line(size = 1.5) +
labs(title = "Average Distance per Day by User Type",
x = "Day of Month",
y = "Average Distance (km)",
color = "User Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 30, by = 3)) # Set breaks every 3 days
# Calculate total distance for each day and user type
daily_total_distance_user <- merged_data %>%
group_by(day_of_month, member_casual) %>%
summarize(total_distance = sum(trip_length_km))
# Plot the results with line chart
plot_4 <- ggplot(daily_total_distance_user, aes(x = day_of_month, y = total_distance, color = member_casual)) +
geom_line(size = 1.5) +
labs(title = "Total Distance per Day by User Type",
x = "Day of Month",
y = "Total Distance (km)",
color = "User Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 30, by = 3)) # Set breaks every 3 days
# Calculate trip count for each day of the month and user type
trip_count_per_day_user <- merged_data %>%
group_by(day_of_month, member_casual) %>%
summarize(trip_count = n())
# Plot line chart for each user type
plot_5 <- ggplot(trip_count_per_day_user, aes(x = day_of_month, y = trip_count, color = member_casual)) +
geom_line(size = 1.5) +
labs(title = "Distribution of Trips throughout the Month by User Type",
x = "Day of Month",
y = "Count of Trips",
color = "User Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 30, by = 3)) # Set breaks every 3 days
grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, nrow=5, ncol=1)
Insights:
All the factors share the same pattern of increase in days (3,4, 10, 15, 22, 28), a possible reason is that these days often happened to be a weekend every month, which explains why factors like total duration, total distance, and distribution of trips reach the peak in these days.
The following heat map checks if there are correlation between days of the month and weekends
# Create a data frame with all combinations of day of month (1 to 31) and day of week (1 to 7)
heatmap_data <- expand.grid(day_of_month = 1:31, day_of_week = c("Sat", "Sun"))
# Generate random values for the heatmap (you can replace this with your actual data)
set.seed(123) # for reproducibility
heatmap_data$value <- sample(1:100, size = nrow(heatmap_data), replace = TRUE)
# Plot the heatmap
ggplot(heatmap_data, aes(x = day_of_month, y = factor(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")), fill = value)) +
geom_tile() +
geom_text(aes(label = value), color = "black") + # Add data labels
scale_fill_gradient(low = "white", high = "Red") +
labs(title = "Intersection of Day of Month and Day of Week",
x = "Day of Month",
y = "Day of Week") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Unnlockally, I can’t see any pattern when it comes to day of month because it’s change over and over every month
Q: How does each user type use bikes every month (1 - 12)?
# Calculate average duration for each month and user type
monthly_avg_duration_user <- merged_data %>%
group_by(month, member_casual) %>%
summarize(avg_duration = mean(duration_minutes, na.rm = TRUE))
# Define boundaries for seasons
season_boundaries <- c(0.5, 2.5, 5.5, 8.5, 11.5)
# Plot the results with grouped bar chart
plot_1 <- ggplot(monthly_avg_duration_user, aes(x = month, y = avg_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Trip Duration per Month by User Type",
x = "Month",
y = "Average Trip Duration (minutes)",
fill = "User Type") +
theme_minimal() +
geom_vline(xintercept = season_boundaries, linetype = "solid", color = "gray", size = 1.5) # Add vertical lines for season boundaries with a solid line and increased width
# Calculate total duration for each month and user type
monthly_total_duration_user <- merged_data %>%
group_by(month, member_casual) %>%
summarize(total_duration = sum(duration_minutes))
# Plot the results with grouped bar chart
plot_2 <- ggplot(monthly_total_duration_user, aes(x = month, y = total_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Duration per Month by User Type",
x = "Month",
y = "Total Duration (minutes)",
fill = "User Type") +
theme_minimal() +
geom_vline(xintercept = season_boundaries, linetype = "solid", color = "gray", size = 1.5) # Add vertical lines for season boundaries with a solid line and increased width
# Calculate average distance for each month and user type
monthly_avg_distance_user <- merged_data %>%
group_by(month, member_casual) %>%
summarize(avg_distance = mean(trip_length_km, na.rm = TRUE))
# Plot the results with grouped bar chart
plot_3 <- ggplot(monthly_avg_distance_user, aes(x = month, y = avg_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Distance per Month by User Type",
x = "Month",
y = "Average Distance (km)",
fill = "User Type") +
theme_minimal() +
geom_vline(xintercept = season_boundaries, linetype = "solid", color = "gray", size = 1.5) # Add vertical lines for season boundaries with a solid line and increased width
# Calculate total distance for each month and user type
monthly_total_distance_user <- merged_data %>%
group_by(month, member_casual) %>%
summarize(total_distance = sum(trip_length_km))
# Plot the results with grouped bar chart
plot_4 <- ggplot(monthly_total_distance_user, aes(x = month, y = total_distance, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Distance per Month by User Type",
x = "Month",
y = "Total Distance (km)",
fill = "User Type") +
theme_minimal() +
geom_vline(xintercept = season_boundaries, linetype = "solid", color = "gray", size = 1.5) # Add vertical lines for season boundaries with a solid line and increased width
# Summarize total trip count for each month and user type
monthly_trip_count_user <- merged_data %>%
group_by(month, member_casual) %>%
summarize(trip_count = n())
# Plot the results with grouped bar chart
plot_5 <- ggplot(monthly_trip_count_user, aes(x = month, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Trips per Month by User Type",
x = "Month",
y = "Trip Count",
fill = "User Type") +
theme_minimal() +
geom_vline(xintercept = season_boundaries, linetype = "solid", color = "gray", size = 1.5) # Add vertical lines for season boundaries with a solid line and increased width
grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, nrow=3, ncol=2)
Insights
Average Duration: Casual users consistently exhibit higher average trip durations compared to annual users. Notably, casual users show a peak in average duration during weekends (32 minutes), while experiencing a decline in the middle of the week (24 minutes). Conversely, annual users demonstrate a similar pattern with slightly lower averages. This discrepancy may be attributed to weekend trips primarily catering to tourist cycling and leisure activities.
Average Distance: Both user types maintain a stable average distance throughout the week, approximately 2.06 km per trip.
Total Duration: On weekdays, the total duration of trips for both user types remains nearly identical. However, during weekends, casual users’ total duration significantly increases, doubling their weekday duration. In contrast, annual members’ total duration remains consistent even on weekends. This trend suggests that casual users engage in more cycling activities for leisure purposes on weekends, while annual users maintain a consistent routine of commuting trips.
Total Distance: Annual users tend to undertake longer trips on weekdays, supporting the notion that they primarily use bikes for commuting to work. Conversely, casual users record lower total distances with a slight increase on weekends, indicating that their trips are shorter in distance but longer in duration, aligning with leisurely activities.
Distribution of Trips: The distribution of trips reveals distinct patterns for each user type. While annual members account for the majority of trips, they tend to book more trips on weekdays (Tuesday to Thursday). In contrast, casual members book more trips on weekends. This observation supports the idea that annual members prioritize weekday commuting, while casual members prefer weekend recreational activities.